43 - Deep Learning - Plain Version 2020 [ID:21177]
50 von 89 angezeigt

Welcome back to deep learning. So today we want to go deeper into reinforcement learning and the concepts that we want to explain today are going to be the concepts of policy iteration and how to make better policies towards designing strategies of winning games.

So let's have a look at the slides that I have here for you. So it's the third part of our lecture and we want to talk about policy iteration.

Now, before we had this action value function that somehow could assess the value of an action.

Of course, this now has also to depend on the state T and this is essentially our, you could say, oracle that tries to predict the future reward GT following a certain policy that is dependent on the action and the state.

Now, we can also find an alternative formulation here, and we introduce the state value function. So previously we had the action value function that taught us how valuable a certain action is.

And now we want to introduce the state value function that tells us how valuable a certain state is.

And here you can see that it is formalized in a very similar way. Again, we have some expected value over our future reward. And this is now of course dependent on this date.

So we kind of leave away the dependency on the action and we only focus on the state. And you can now see that this is the expected value of the future reward with respect to the state.

So we want to marginalize out the actions. We don't care what the influence of the action is. We just want to figure out what is the value of a certain state.

And we can actually compute this. So we can also do this for our grid example. If you recall this one, you remember that we had the simple game where you had A and B that were essentially the locations on the grid that would then teleport you to A' and B'.

And once you arrive at A' and B', you get a reward. For A' it's plus 10 and for B' it's plus 5. Whenever you try to leave this board, then you get a negative reward.

And now we can play this game and compute the state value function. And of course we can do this under the uniform random policy because we don't have to know anything about the game.

So we can simply play the random uniform policy. We can simply choose actions, play this game for a certain time and then we are able to compute the state values according to the previous definition.

And we can see that the edge tiles in particular in the bottom, they even have a negative value. Of course they can have negative values because if you are in the edge tiles where we find minus 1.9 and minus 2.0 on the bottom left and bottom right,

there is a 50% likelihood that you will try to leave the grid. And in these two directions you will of course generate a negative reward.

So you can see that we have states that are negative and then we have states that are much more valuable.

And you can see if you look at the position where A and B is located, they have a very high value. So A has an expected future reward of 8.8 and the tile with B has an expected future reward of 5.3.

So these are really good states. So you could say with this state value we have somehow learned something about our game. So you could say, okay, maybe we can use this and we can now use the greedy action selection on this state value.

So let's define a policy and this policy is now selecting always the action that leads into a state of a higher value.

If you do so, you have a new policy. And if you play with this new policy, you get a better policy.

So we can now relate this to the action value function that we used before and we somehow introduced the state value function in a similar role.

So we can now see that we can introduce this action value function that is Q policy of S and A, so of the state and the action.

And this then basically accounts for the transition probabilities. So you can now compute your Q policy of state and action as the expected value of the future reward, given the state and the action.

And you can compute this in a similar way. Now you then get an expected future reward for every state and for every action.

Are all of these value function created equal? No, there can only be one optimal state value function.

And we can show its existence without referring to a specific policy. So the optimal state value function is simply the maximum over all state value functions with the best policy.

So the best policy will always produce the optimal state value function.

Now we can also define optimal action value function. And this can now be related to our optimal state value function.

And we can see that the optimal action value function is given as the expected reward in the next step, plus our discount factor times the optimal state value function.

So if we know the optimal state value function, then we can also derive the optimal action value function. So they are related to each other.

So this was the state value function for the uniform random policy. And I can show you the optimal V star, so the optimal state value function.

And you see that this has much higher values, of course, because we have been optimizing for this.

And you also observe that the optimal state value function is strictly positive because we are in a deterministic setting here.

So very important observation in a deterministic setting, the optimal state value function will be strictly positive.

Now we can somehow order policies. We somehow have to determine what is a better policy and we can order them with the following concept.

So a better policy pi is better than a policy pi prime if and only if the state values of pi are all higher than the state values that you obtain with pi prime for all states in the set of states.

And if you do this, then any policy that returns the optimal state value function is an optimal policy.

So you see that is only one optimal state value function, but there might be more than one optimal policy.

So there could be two or three different policies that result in the same optimal state value function.

So if you know either the optimal state value or the optimal action value function, then you can directly obtain an optimal policy by greedy action selection.

So if you know the optimal state values and if you have complete knowledge about all the actions and so on, then you can always get the optimal policy by greedy action selection.

So let's have a look how this would then actually result in terms of policy.

Now greedy action selection on the optimal state value function or the optimal action value function would lead to the optimal policy.

What you see here on the left hand side is greedy action selection on the uniform random state value function.

So what we computed earlier in this video, you can of course choose your action in a way that you have the next state being a state of higher value and you end up with this kind of policy.

Now, if you do the same thing on the optimal state value function, you can see that we essentially emerge with a very similar policy.

You see a couple of differences. In fact, you don't always have to move up like shown in the left hand side.

So you can also move left or up in several occasions and you can actually choose the action at each of these squares that are indicated with multiple arrows with equal probability.

So if there's an up and left arrow, you can choose either one action or the other and you would still have an optimal policy.

So this would be the optimal policy that is created by greedy action selection on the optimal state value function.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:17:39 Min

Aufnahmedatum

2020-10-12

Hochgeladen am

2020-10-12 20:56:20

Sprache

en-US

Deep Learning - Reinforcement Learning Part 3

This video explains a fundamental learning technique in reinforcement learning: Policy Iteration.

For reminders to watch the new video follow on Twitter or LinkedIn.

Further Reading:
A gentle Introduction to Deep Learning

Einbetten
Wordpress FAU Plugin
iFrame
Teilen